Nephology of verbs
Introduction
This document makes a first description of the clouds of 12 Dutch verbs, herroepen, heffen, huldigen, haten, herhalen, herinneren, diskwalificeren, harden, herstellen, helpen, haken, herstructureren. In the introductory section, the parameter selection will be described, followed by a schema of the analysis workflow. Then we’ll bring together the main conclusions from the individual analysis into a summary section, as an attempt to abstract away from the special cases of each lemma and get closer to more generalizable insights. Finally, the analysis of each verb will be described in their respective sections.
The central point is the summary; the descriptions in the section for each lemma give some more details and plots.
Parameters
The models under analysis are 2D representations of distance matrices between 200-320 tokens of each lemma, extracted from the QLVLNewsCorpus. Both the targets and the first and second order features are lemma/part-of-speech pairs, such as haak/verb, beslissing/noun, in/prep. The features (or context words) can have any part of speech except for punctuation and have a minimum frequency of 1 in 2 million (absolute frequency of 227) after discarding punctuation from the token count. There are 60533 such types in the corpus.
There are three kinds of parameters that were varied along: first-order selection parameters, PPMI weighting and second-order selection parameters. While the first-order parameters influence which context features will be selected from the surroundings of the target tokens, the second-order parameters influence the shape of the vectors that represent such first-order features, and the PPMI weighting does both.
It could also be possible to vary along other parameters, such as the similarity function (we used cosine) and whether there is some dimensionality reduction performed (there was not), but they were set to a fixed value in this case.
Finally, we used two different techniques to reduce the distance matrices to 2 dimensions: nMDS (non-metric Multidimensional Scaling) and t-SNE (T-distributed Stochastic Neighbor Embedding). For the latter, we tried different perplexity values, namely 10, 20, 30 and 50 (although for some models there were not enough tokens to use perplexity of 50). While analyzing and comparing models, we will take into account what they look like in the different solutions.
First-order selection parameters
The first set of parameters affects which context words will be selected from the context of each token for the first order vectors. In the current models, the main distinction is made by BASE: between bag-of-words (BOW) based and dependency-based models (LEMMAPATH and LEMMAREL). The former are further split by window size, part-of-speech filters and whether sentence boundaries are respected.
FOC-WIN(first order window)- In BOW-based models, a symmetric window of 3, 5 or 10 tokens to each side of the target was used.
- In dependency models, no restrictions to window size were imposed.
FOC-POS(first order part-of-speech)- In BOW-based models, a restriction can be placed to only select (common) nouns, adjectives, verbs and adverbs (
lex) in the surroundings of the token. If no restriction is placed, the value of this parameter isall. - In dependency models, no restrictions on part-of-speech were imposed, although
LEMMARELimplies them. BOUNDARIES- In BOW-based models,
yesorboundindicates that only features in the same sentence as the target were included, whilenoornoboundignores sentence delimiters. - Dependency models are, of course, always
yes. LEMMAPATH- This set of dependency-based models selects the features that enter a syntactic relation with the target with a maximum number of steps. Features more than 3 steps away from the target were always excluded.1
- A one-step dependency path is either the head of the target or its direct dependent. Such features are included by both
selection2andselection3and receive a weight of 1 inweight. - A two-step dependency path is either the head of the head of the target, the dependent of its dependent, or its sibling. Such features are included by both
selection2andselection3and receive a weight of 2/3 inweight. - A three-step dependency path is either the head of the head of the head of the target, the sibiling of the head of its head, the dependent of the dependent of its dependent or the dependent of a sibling. An example of the last path is the subject of a passive construction with a modal, where the target is the verb in participium. Such features are included in
selection3but excluded fromselection2and receive a weight of 1/3 inweight.
LEMMAREL- This set of dependency-based models selects the features that enter in a certain syntactic relation with the target. The paths were based on the top paths selected as cues (see Annotation) and, while they account for conjuncts and some modals, they don’t include long chains of them.
group1includes direct objects, active and passive subjects (with up to two modals for the active one), reflexive complement and prepositions depending directly on the target.group2expandsgroup1with conjuncts of the verb, complementizers, nouns depending through a preposition and verbal complements or elements of which the target is a verbal complement.
PPMI weighting
The PPMI parameter is taken outside the set of first-order parameters because it can both filter out first-order features and reshape their vector representations. In truth, the choice of positive pointwise mutual information (PPMI) over other weighting mechanisms, as well as setting a threshold or not, is already a parameter setting, which in these circumstances is set to PPMI and a threshold of 0. In all cases, the PPMI was calculated based on a 4-4 window (that could also be a variable parameter).
This parameter can take three values. selection and weight mean that only the first-order features with a PPMI > 0 with the target type are selected, and the rest discarded, while no does not apply the filter. The difference between selection and weight is that the former only uses the value to filter the context features, while the latter also weighs their vectors with that value.
Second-order selection
The selection of second-order features influences the shape of the vectors: how the selected first-order features are represented. While the frequency transformation and the window on which such values were computed could be varied, they were set to fixed values, namely PPMI and 4-4 respectively. The parameters that were varied across, although we don’t expect drastic differences between the models, are vector length and part-of-speech.
SOC-POS(second order part-of-speech)- This parameter can take two values:
navandall. In the former case, a selection of 13771 lect-neutral nouns, adjectives and verbs made by Stefano is taken as the set of possible second-order features. In the latter, all lemmas with frequency above 227 and any part-of-speech are considered. LENGTH- Vector length is the number of second-order features and therefore the dimensionality of the matrices on which the distance matrices are based, although the amount is not all that changes. It is applied after filtering by part-of-speech.
- We have selected two values:
5000andFOC. The former includes the 5000 most frequent elements of the possible features, while the latter takes the intersection between the possible second-order-features and the first-order-features, regardless of frequency. WithSOC-POS:all,FOCwill include all first-order features of that model, while withSOC-POS:nav, only those included in Stefano’s selection. - The actual number of dimensions resulting from
FOCdepends on the strictness of the first order filter. This information can be found on the plots that, for each lemma, show how many first order context words are left after each combination of first order filters.
FOC as SOC
What does it mean to use the same first-order context words as second-order context words?
First, depending on the number of target tokens and the strictness of the filter, there could be a different number of context words, ranging in the hundreds or low thousands.
Second, the context words will be compared based on their co-occurrence with each other. The behaviour of a context word outside the context of the target will be largely ignored: of course, the association strength between two items has to do with their co-occurrence across the whole corpus, as well as their non-co-occurrence, but it will only be included in the second order vector of the first item if the second is also among the first order context words.
Steps in the analysis
For each of the lemmas, the analysis workflow implied:
- observing and describing the cloud of models, to find patterns in the qualitative effect of the parameters;
- corroborating the patterns with v-measures on hierarchical clustering solutions;
- selecting different sets of models to compare, based on the stronger parameters;
- observing and describing the models to compare;
- computing separability indices and v-measures on the models to fnd “best models” based on the senses
In the first step, level 1 of the visualization tool was used to visually assess how the different parameters grouped the models of each lemma. Five main patterns were identified, described in the visual examination subsection.
In the second step, we clustered the models based on different hierarchical clustering methods and compared the resulting clusters with a classification based on the parameters values to assess the agreement between the clustering and the parameters. The process and results are described in the subsection “Based on v-measures”.
The description of each lemma is split in three parts: “Strength of parameters”, “First order filters” and “Notes on the clouds”. The first section shows the cloud of models and highlights which parameters are the strongest. In the second section, the quantiative effect of the first order filters is illustrated (i.e. on the number of tokens, the total number of first order context words and the number of first order context words per token).
Finally, the third section summarizes steps 3 through 5 of the workflow, with generalizing notes on the visual comparison between models (including some comments on the distances), plots of the best models based on separability indices and v-measures, and lastly plots showing the values of the separability indicies.
Summary
Classification of verbs based on parameters
Based on visual examination
As a first step in the analysis, the level1 clouds of all the verbs where examined to see which parameters grouped models together in its nMDS reduction. For each level, two kinds of distance matrices were created from a procrustes analysis of the token-level matrices: one with the original data, and one with a log transformation (log(1 + log(rank)), rank being the row-wise ranks of the original distance matrix). While this gave slightly different structures for each lemma (except for haken, where they were identical), which parameters were strongest stayed usually the same. The discussion below pertains to the procrustes based on the transformed matrices, but unless specified, it also applies to the non transformed ones.
There are five main situations that could be identified in the different lemmas and that seemed relevant. The groups, diagrammed in Figure 1, are:
- FP (middle rows): There is a main split based on
FOC-POS, withFOC-POS:lexseparated from the rest andFOC-POS:allbetween it and the dependency-based models (or at least closer to the dependency-based models). - FW (bottom rows): There is a sequential split based on
FOC-WIN, withFOC-WIN:5betweenFOC-WIN:3FOC-WIN:10. In one case the dependency-based models are on the other side ofFOC-WIN:3(herstructureren), in another they are on the side ofFOC-WIN:10(heffen), but in the rest theBASE-division is transversal to theFOC-WIN-division. - P (middle columns): There is some sort of organization based on
PPMI. It isn’t very strict because, except for herhalen, the distinction between models with differentPPMIis not so clear. - LSP (right columns): The models with
LENGTH:5000 + SOC-POS:allare grouped together and separated from the rest. This can be crosscutting one of the other divisions or even as the strongest split. - LR1 (black text instead of gray): The models with
LEMMAREL:group1are clustered together independently from the rest of the divisions (sometimes with the LSP group).
Figure 1. Verbs organized by most important parameters.
Based on v-measures
In order to objectify the previous description of the effect of parameters, v-measures were calculated to estimate the variance explained by each parameter. Concretely, for each lemma a hierarchical cluster analysis was performed, and its results were evaluated against the values of each parameter. High values mean that the clustering solution agrees with grouping based on a certain parameter. A number of clustering methods were used (“ward.D”, “single”, “complete”, “average”, “mcquitty”, and “ward.D2” in the hclust() function)2; instead of looking for “high” values (which varied greatly across lemmas, parameters and clustering methods), we extracted the parameters with the highest 4 (out of 9) values based on at least 5 methods. While the quality of the match between the clustering solution and the parameter may vary, the idea is to extract the most important ones.
This workflow was applied both to the nMDS solutions and to the procrustes distance matrices themselves. The most important parameters are FOC-WIN, FOC-POS, BASE + BOUNDARIES (division between BOW-based with boundaries, BOW-based without boundaries and dependency-based) and SOC-VECTOR (interaction between LENGTH and SOC-POS, quite close to the LSP group above). The comparison between the resulting groupings is shown in Table 1.
| criterion | visually | vmeasures_nMDS | vmeasures_proc |
|---|---|---|---|
| FOC-WIN | diskwalificeren; haken; herinneren; harden; heffen; herstructureren | blik; diskwalificeren; harden; herinneren; herroepen; herstellen; herstructureren; schaal; spoor; spot; staal; stof | diskwalificeren; haken; harden; heffen; helpen; herinneren; herroepen; hoop; huldigen; schaal; spoor; staal; stof |
| FOC-POS | haken; haten; diskwalificeren; herinneren; herstructureren; herhalen; herstellen; helpen | blik; diskwalificeren; haten; herinneren; herinneren; herstructureren; hoop; huldigen; spoor | blik; blik; haken; harden; haten; herstructureren; hoop; spoor; spot |
| SOC-VECTOR | herroepen; heffen; helpen; herhalen; herstructureren; harden; haten; herstellen | haken; heffen; helpen; herroepen; herstellen; hoop; horde; schaal; spot; staal; stof | helpen; herhalen; herroepen; horde; huldigen; schaal; staal; stof |
| BASE+BOUND | NA | harden; haten; heffen; herhalen; herroepen; horde; stof | harden; heffen; herinneren; herroepen; herstellen; herstructureren; horde; huldigen; schaal; staal; stof |
| Others | PPMI: huldigen; herinneren; herhalen | haten (foc_base); huldigen (foc_ppmi) | herroepen (foc_base); herstructureren (foc_base) |
A large number of groupings (in bold in the table) are “confirmed” by both v-measure calculations. However, this relationship should not be taken at face value. First, it must be noted that only the ranking of the v-measure values was taken into account, so the variance explained by each parameter might still be low; in very few cases go the v-measure values above 0.1. Moreover, the patterns caught by the visual inspection don’t match perfectly what the v-measures assess.
- When it comes to
FOC-WIN, v-measures will expect to match four clusters to the threeFOC-WINvalues and the dependency-based models. FW is more abstract and more specific at the same time: more abstract because it can allow for different degrees of separability, and forFOC-WINandBASEto use different dimensions (i.e. splitting the plot orthogonally); more specific because it requires theFOC-WINorganization to be sequential, withFOC-WIN:5clearly between the other two. - In the case of
FOC-POS, v-measures will expect to match three clusters to the twoFOC-POSvalues and the dependency-based models, while FP rather captures a contrast betweenFOC-POS:lexon the one hand andFOC-POS:all|NAon the other, allowing for different degrees of overlap between the latter two. A better matching division was also tested with the v-measures, which contributed one lemma in each case (based on 2D or procrustes). - V-measures will also assess
SOC-VECTORas to how well the four main clusters match its four values, while LSP only requires forLENGTH:5000 + SOC-POS:allto be separated from the rest. A better matching division was also tested with the v-measures but it was not high enough in the rankings.
What do the clouds say?
The intuition behind Distributional Semantics is that you can use the context of an item to describe its meaning, which pairs nicely with the Cognitive Linguistics match of meaning and use. For these case studies, lexicographical meanings (or rather, simplified versions of them) were used to classify tokens, in order to compare vector space models with such a classification.
Based on the analysis of the 12 verbs described in this document, vector space models do not model lexicographical meanings. This doesn’t mean that they cannot model lexicographical meanings, but that vector space models do something else and, only to the degree that lexicographical meanings correlate with that something, can vector space models model them.
If not lexicographical meanings, what do vector space models model?
The short answer is: context (which is not unexpected, but also not informative). Vector space models seem to model the strength and clarity of collocations (and collostructions?).
The long answer goes as follows. The verbs described below exhibit different numbers of senses, with different frequency distributions. However, these do not seem to have a great effect in the models’ ability to disambiguate them. For example, four verbs have two senses: heffen, huldigen, haten, and herroepen: in the first two, one of the two senses is at least twice as frequent as the other, while in the last two, the senses are almost equally frequent. They are all affected differently by the parameters, and while heffen and herroepen are among the ones with best separability, haten is one of the worst. Furthermore, in harden the most frequent sense is more compact and organized, while in helpen less frequent senses are better modeled than more frequent ones.
The grouping of tokens, as distinct areas or even pockets of tokens that flock together, is rather driven by frequent and/or strong collocates, which could be either lexical or functional (characterizing certain constructions, syntactic behaviour). This is based on visual inspection of the clouds and by looking at the most frequent shared words by different groups of tokens, but more specific analyses should be performed that operationalize these insights more efficiently. The concrete ways in which this takes place are:
- There are frequent lexical collocates, with relatively high PPMI, that group tokens together. These are taken by all models,
FOC-POS:lexare not necessarily better;PPMImay have a stronger influence.- In heffen, these are belasting for heffen_2 and glas, hand, arm, hemel and vinger for heffen_1. Because their PPMIs are so high the clouds are well split together. Because glas is so different from the rest of the concrete objects, the split is not only between senses but also between glas and the rest of the concrete tokens.
- In harden, these are stank and pijn in some models and meer in others, always for harden_5: they split it and don’t fully cover it.
- In diskwalificeren, these are vals, start and meter, for diskwalificeren_2: they split it (between tokens that have them and those that don’t) and don’t fully cover it.
- In haken, these are strafschop[gebied] and pootje, for haken_3 (they split it). Furthermore, context words in the semantic domain of hobbies (?) group haken_5 together.
- In helpen, the only one is zeep (which has “fixed” as part-of-speech, so it’s not really considered by
FOC-POS:lexmodels), and it disambiguates an idiomatic expression. - In herroepen, these are beslissing for herroepen_1 and verklaring and uitspraak for herroepen_2. Furthermore, the uitspraak (herroepen_2) tokens are always close to herroepen_1 tokens that co-occur with context words from the legal semantic field.
- In herhalen, geschiedenis for herhalen_3 (it may split it and does not cover it).
- In huldigen, these are principe, standpunt and opvatting, all for huldigen_2; they may split it.
- In herstructureren, bedrijf excludes herstructureren_1 and makes the other two overlap.
- In herstellen, these are the multiword expression in ere and evenwicht, of herstellen_2; they split it and do not cover it. Financial topics seem to group herstellen_4 and herstellen_6.
- This is absent from herinneren and haten.
- There are frequent constructions (grammatical elements of combinations thereof) that group tokens together, normally better in
LEMMAPATHand never inFOC-POS:lex:- In harden, niet+te groups harden_5; it is also part of its definition.
- In diskwalificeren, zich groups diskwalificeren_3; it is also part of the definition.
- In haken, naar groups haken_6, en groups haken_5 and the combination of in and elkaar groups haken_1 and haken_2. Only the first behaviour is definitional.
- In helpen, aan groups helpen_5 and the combination of niet and het groups helpen_3. Only the first behaviour is definitional.
- In herhalen, zich groups herhalen_3 (it is also part of its definition) and dat, herhalen_2.
- In herinneren, (er)aan groups herinneren_1 and [reflexive] pronouns group herinneren_2, which is part of their definitions. However, models also tend to distinguish between first (ik+me) and third (zich) person for herinneren_2, adding an extra split.
- In haten, the combination of ik and het groups mostly haten_1 tokens (in a sense, it is definitional).
- In herstructureren, the combination of om and te tends to exclude herstructureren_2 and make the other two overlap. This is not definitional.
- In herstellen, the combination of zijn and van groups herstellen_5, which is not really part of the definition but might be deduced from it.
- This is absent from heffen, herroepen and huldigen.
- There are lexical collocates that serve a grammatical function, such as worden indicating passive construction:
- worden groups heffen_2 with heffen_1-glas, herhalen_1 with herhalen_4, herstructureren_1 with herstructureren_2, huldigen_1, and diskwalificeren_2.
- blijven groups haken_4.
- laat groups a number of herroepen_2 tokens.
The degree to which a model models lexicographical senses is related to the degree to which the senses match these constructional/collocational patterns –either as part of the definition or not. If the collocate (it occurs more with lexical collocates than with constructions) is too strong, it may even split a sense between the tokens where it occurs and those were it doesn’t. Some senses do not exhibit clear patterns and are therefore more scattered in the models, overlapping; in some cases the low frequency of the sense probably plays a role too, but if it was linked to a strong collocate it could still cluster.
- In harden, the only sense that groups together is harden_5.
- haken_1 does not group together (if anything, some of its tokens go with haken_2).
- helpen_1, helpen_2 and helpen_4 are not grouped.
- herinneren_3 is not grouped (it is also extremely infrequent).
- The senses of haten are not well disambiguated.
- herstellen_1 and herstellen_3 are not grouped.
In order to further investigate these observations, we should try to answer the following questions:
- Is there a generalizable predictor of the “power” of the collocates? The main suspects would be PPMI, frequency in the sample and similarity (rank) of the competition (e.g. for herropen, similarity (rank) between the type-level vectors of beslissing and verklaring).
- For example, is there a PPMI threshold that generates separate clusters? Or a combination of predictors?
- I would first take a manual look at the FOCs of each lemma and see the distribution of PPMI and frequency values, where the collocates stand, and, at least for the identified collocates, compute the nearest neighbors. Based on that, I might use statistical modelling.
- Would the same predictors work for lexical and functional words? (The latter would normally have lower PPMI.) In other words, is it enough to select models that include functional words and those that don’t in order to model constructions? In general, lexical and functional collocations seem to complement each other rather than compete.
Understanding this would not allow us to fit collocational and constructional patterns to senses, but to model the patterns themselves.
HEFFEN
The sample of heffen tokens consists of 218 tokens, with the following sense frequency: heffen_1: 78, heffen_2: 140.
Based on visual analysis of the cloud of models, heffen belongs to FW (sequential window structure, in this case with dependency-based beyond FOC-WIN:10) and LSP (LENGTH:5000 + SOC-POS:all separated). There is some PPMI structure but only in the LSP group, and no clear grouping of FOC-POS:lex against the rest or special clustering of LEMMAREL:group1. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, suggest a division by BASE+BOUNDARIES; the former confirms a division based on SOC-VECTOR and the latter, based on FOC-WIN.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of heffen created on 09/07/2020, modeling between 189 and 218 tokens. The stress value of the MDS solution for the cloud of models is 0.177. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 2, there is a progression from FOC-WIN:3 at the bottom towards FOC-WIN:10 and then the dependency based-models. The main split is given by LSP: the LENGTH:5000 + SOC-POS:all models are on the right half, and the rest to the left.
Figure 2. Cloud of models of ‘heffen’. Explore it here.
First order filters
Figure 3 and Figure 4 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
Among the BOW-based models, tokens are lost from FOC-WIN:3 + FOC-POS:lex models, especially if both are combined with PPMI:yes; among the dependency-based models, only the LEMMAREL ones do, especially LEMMAREL:group1 + PPMI:yes (11.01%).
Figure 3. Total remaining and context words of ‘heffen’.
Figure 4. Remaining contextwords per token of ‘heffen’.
Notes on the clouds
The concordances of heffen seem to be characterized by strong lexical collocates: the nouns glas (PPMI = 5.66, while glaasje = 4.07)3, hand (3.06), arm (4.07), hemel (4.87), belasting (6.23, but also a number of compounds), tol (6.56) and accijns (6.21). The first four correspond to heffen_1 and the rest to heffen_2.
Another relatively important collocation is worden (1.30), which is not as strong as the others to cluster tokens together in PPMI:weight models, but it does seem to occur in many heffen_2 tokens (complementing, rather than ovelapping with, belasting) and in some glas tokens, which makes for an interesting typology.
On the one hand, the collocates seem to be strong and frequent enough that most models are very similar to each other, and the clustering is quite clear in all of them. On the other hand, it would seem that strong filters, and in particular PPMI:weight, reinforce the power of the lexical collocates, splitting heffen_1 in two, while noisier models moderate that power, joining the heffen_1 subclouds (glas versus the rest).
It could be argued that the metonymical nature of een glas heffen (and de handen ten hemel heffen and een vinger(tje) heffen, for that matter) are grounds for a separation between what was considered the “concrete” sense of heffen. However, it is not yet clear whether that is what the models pick up on. Is the whole context also indicative of different situations? Is glas very different from hand, arm and vinger simply because of the different semantic field or also because of the metonymical extensions they participate in? More research into other verbs with similar profiles would be needed to understand the phenomenon, but in principle it can be said that:
heffen is characterized by strong lexical collocates from different lexical fields, but their strength can be mitigated by noisy vectors, without distorting the topology of the clouds too much.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMI. - Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:5000 + SOC-POS:all, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMIandLENGTH+SOC-POS. - Three groups of 6 models with
BOUNDARIES:yes + BASE:BOW + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandFOC-POSwithin the BOW-based models but also assess the effect ofPPMI. - One group of 9 models with
LEMMAREL:group1 | LEMMAPATH + LENGTH:FOC + SOC-POS:navto compare across different dependency-based models but also assess the effect ofPPMI.
The differences between BOUNDARIES was not explored because the models seem to be very close to each other. For the LENGTH:FOC + SOC-POS:nav models, distances rarely go above 0.3, while for LENGTH:5000 + SOC-POS:all they go up to around 0.5 (which is still quite low). Values are higher for pairs of values with different BASE and for PPMI:no.
The heffen clouds tend to have very good separability, quite homogeneous in terms of senses but not complete. In the first set, even the nMDS solutions show distinct clusters, with greater distance in PPMI:weight and smaller in PPMI:no. One of the clusters corresponds to heffen_2 and the other two to heffen_1: one for the collocation with glas and one for the rest (mostly collocation with hand+hemel, arm and vinger). In models with PPMI:selection|weight, some smaller clusters for the co-occurrence with tol and accijns (heffen_2) are visible, especially in t-SNE models with low perplexity, as well as two entrée (heffen_2) tokens together and far from the rest of heffen_2.
FOC-WIN:10 + PPMI:no seems to decrease the distance between the heffen_1 subgroups (at least in t-SNE with perplexity 20) and both that combination and LEMMAPATH + PPMI:no pull the glas cluster towards heffen_2 because of the co-occurrence with worden.
These very evident clusters in LENGTH:FOC + SOC-POS:nav are not as clear in the nMDS solutions of LENGTH:5000 + SOC-POS:all unless PPMI:weight (which leads to two separate glas clusters in t-SNE with perplexity 10). While the senses are relatively grouped together, they are not as neat.
The rest of the groups don’t add significant insights.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The 8 models whose values lie in the top 10% for all ranks are listed below. They are mostly t-SNE solutions with perplexity 50, LENGTH:FOC + FOC-WIN:10 + FOC-POS:lex + PPMI:selection|weight.
- BOWnobound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50)
- BOWbound10lex.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.30)
- BOWnobound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50)
- BOWbound10lex.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.50)
- BOWnobound10all.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50)
- BOWbound10lex.PPMIselection.LENGTHFOC.SOCPOSall (tsne.50)
Of these methods, the ones with highest ranking v-measures across different methods are BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50) and BOWnobound10all.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50), illustrated in Figure 5.
Figure 5. Best models of ‘heffen’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 2.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 3.304 | 0.975 | 0.972 | 0.642 | 3.250 | 0.974 | 0.967 | 0.640 |
| meanclassqual | 3.386 | 0.973 | 0.972 | 0.655 | 3.304 | 0.973 | 0.966 | 0.648 |
| classqual | ||||||||
| heffen_1 | 3.675 | 0.966 | 0.972 | 0.700 | 3.494 | 0.970 | 0.961 | 0.675 |
| heffen_2 | 3.096 | 0.981 | 0.972 | 0.610 | 3.114 | 0.976 | 0.971 | 0.620 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for heffen is shown in Figure 6.
The highest values of the local indices (“kNN” and “SCP”) are very high and decrease very slowly, while the global ones decrease a bit faster at the beginning and then quite slow; the maximum Silhouette (“SIL”) values are a bit over 0.6, which is quite reasonable. Furthermore, the values are very similar for all senses and globally.
Figure 6. Separability indices of ‘heffen’ by rank for different measures and levels.
Figure 7. Separability indices of ‘heffen’ by harmonic mean of ranks for different measures and levels.
HARDEN
The sample of harden tokens consists of 279 tokens, with the following sense frequency: harden_1: 3, harden_2: 10, harden_3: 65, harden_4: 10, harden_5: 191.
Based on visual analysis of the cloud of models, harden belongs to FW (sequential window structure, in this case with FOC-WIN:3 farther from the rest and dependency-based in an orthogonal position), LSP (LENGTH:5000 + SOC-POS:all separated, although not as fully as in other cases), µand LR1 (LEMMAREL:group1 is far away from LEMMAREL:group2, although still within the dependency-based area). There is no PPMI structure, or clear grouping of FOC-POS:lex against the rest. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm only the FW grouping and suggest BASE+BOUNDARIES; the latter also confirms FOC-POS.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of harden created on 13/07/2020, modeling between 244 and 279 tokens. The stress value of the MDS solution for the cloud of models is 0.158. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 8, the cloud takes the shape of a table or bridge, with the right leg made of FOC-WIN:3 tokens, the left one made of other BOW-based tokens (FOC-WIN:5 more to the right) and the dependency-based models at the top. SOC-POS:all and LENGTH:5000 tend to go towards the bottom.
Figure 8. Cloud of models of ‘harden’. Explore it here.
First order filters
Figure 9 and Figure 10 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
Very few tokens are lost in BOW-based models, mainly by PPMI:yes, FOC-POS:lex, FOC-WIN:3 and BOUNDARIES:yes, but at most 8 out of 279. LEMMAREL, on the other hand, loses many more tokens, especially LEMMAREL:group1, and even more in combination with PPMI:yes (11.47%). These also have a very low number of first order context words: less than 300 for LEMMAREL and between 200 and 500 for LEMMAPATH, similar to small windows with FOC-POS or PPMI filter (but with a seemingly more efficient distribution: the same number of context words per token, for a lower total number of context words).
Figure 9. Total remaining and context words of ‘harden’.
Figure 10. Remaining contextwords per token of ‘harden’.
Notes on the clouds
The concordance of harden is mostly characterized by the asymmetric (zipfian) distribution of the senses and the stable construction that matches the most frequent one. Some internal structure of that sense (such as strong lexical collocations) can emerge, but that of the rest of the tokens is too weak in comparison.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMI. - Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:5000 + SOC-POS:all, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMIandLENGTH+SOC-POS. - Three groups of 6 models with
BOUNDARIES:yes + BASE:BOW + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandFOC-POSwithin the BOW-based models but also assess the effect ofPPMI. - One group of 9 models with
LEMMAREL:group1 | LEMMAPATH + LENGTH:FOC + SOC-POS:navto compare across different dependency-based models but also assess the effect ofPPMI.
The differences between BOUNDARIES was not explored because the models seem to be very close to each other. Distances only go below 0.2 for dependency-based models with the same template or for BOW-based models with the same FOC-WIN and PPMI:selection|weight. They can go up to 0.8 or 0.9, especially between models where one is FOC-WIN:3 or LEMMAREL.
The main characteristic of the harden clouds is the presence of a very large and very idiomatic sense, namely harden_5 (‘niet te harden’). It makes for a dense core in the nMDS solutions and for a very strong, well formed cluster in LEMMAREL:group1 t-SNE solutions, even with low perplexity, and with any SOC-VECTOR. In other models, this sense is split by other means: FOC-WIN:3 and LEMMAPATH:selection2 distinguish between the presence and absence of meer (“niet [meer] te harden”), while the rest group tokens based on popular objects, mainly pijn and stank. The latter is more clear in LENGTH:FOC + SOC-POS:nav, but always regardless of PPMI.
There is no clear structure for the rest of the senses, except for a relatively low overlap with harden_5 in t-SNE solutions.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class4 and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes 33 solutions; there is no overlap between the top 10% of the different measures.
Of these methods, the ones with highest ranking v-measures across different methods are BOWnobound3all.PPMIselection.LENGTH5000.SOCPOSnav (mds) and LEMMAPATHselection3.PPMIweight.LENGTH5000.SOCPOSnav (tsne.10), illustrated in Figure 11.
Figure 11. Best models of ‘harden’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 3.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 2.887 | 0.925 | 0.922 | 0.510 | 1.253 | 0.922 | 0.912 | 0.053 |
| meanclassqual | 2.263 | 0.887 | 0.872 | 0.332 | 1.608 | 0.892 | 0.869 | 0.218 |
| classqual | ||||||||
| harden_3 | 0.995 | 0.809 | 0.771 | -0.031 | 2.320 | 0.832 | 0.783 | 0.551 |
| harden_5 | 3.531 | 0.965 | 0.974 | 0.694 | 0.896 | 0.952 | 0.955 | -0.114 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for harden is shown in Figure 12.
The local values are relatively high, lower for harden_3 than for harden_5, and decrease slowly, while the global values decrease much faster (especially harden_5, which starts quite high).
Figure 12. Separability indices of ‘harden’ by rank for different measures and levels.
Figure 13. Separability indices of ‘harden’ by harmonic mean of ranks for different measures and levels.
DISKWALIFICEREN
The sample of diskwalificeren tokens consists of 238 tokens, with the following sense frequency: diskwalificeren_1: 68, diskwalificeren_2: 148, diskwalificeren_3: 22.
Based on visual analysis of the cloud of models, diskwalificeren belongs to FW (sequential window structure, in this case with some overlap between them and the dependency-based group in an orthogonal position), FP (FOC-POS:lex separated from the rest, with FOC-POS:all between it and the dependency-based models), and LR1 (LEMMAREL:group1 is separated from the rest of the models). There is no PPMI or SOC-VECTOR structure. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm the FW grouping, while the former confirms the FP grouping.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of diskwalificeren created on 13/07/2020, modeling between 198 and 238 tokens. The stress value of the MDS solution for the cloud of models is 0.201. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 14, the main cloud is divided by FOC-POS and, less clearly, by FOC-WIN; the rogue cloud to the right is made of the LEMMAREL:group1 models.
Figure 14. Cloud of models of ‘diskwalificeren’. Explore it here.
First order filters
Figure 15 and Figure 16 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
Tokens are lost by all models except for the BOW-based models without any filters and the LEMMAPATH|LEMMAREL:group2 + PPMI:no models, in spite of the low number of first order context words in the latter group. All filters contribute to losing tokens, reaching 13.45% in the most extreme case. LEMMAREL:group1 also lose tokens, around as many as FOC-WIN:3 + FOC-POS:nav, even though it has less first order context words.
Figure 15. Total remaining and context words of ‘diskwalificeren’.
Figure 16. Remaining contextwords per token of ‘diskwalificeren’.
Notes on the clouds
The concordance of diskwalificeren is characterized by three main senses, two transitive and frequent ones that apply to different situations (one in sports, the other one in others), and a reflexive one that applies to the second group of situations. The models tend to show greater groups for the better contextually-defined sense and a relationship of inclusion between the less frequent, reflexive sense and the more frequent, transitive counterpart.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMI.
Distances between these models range between 0.2 and 0.8, in the upper half for those with different BASE and lowest third for LEMMAPATH pairs of models, with slightly higher values between PPMI:no models than between PPMI:weight models.
While the overlap between diskwalificeren_2 (the “sports” sense) and the rest varies depending on the model and solution, diskwalificeren_3 (reflexive, semantically similar to diskwalificeren_1) consistently appears included in the diskwalificeren_1 area, even if tightly clustered (sometimes split in two based on the opposition between zich and zichzelf). In a number of models, diskwalificeren_2 has two main clouds: one in which vals, start and meter tend to co-occur, and one in which it doesn’t. A frequent collocate is worden, which can pull tokens from both main senses together.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes 27 solutions; there is no overlap between the top 10% of the different measures.
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAPATHweight.PPMIselection.LENGTHFOC.SOCPOSall (tsne.30) and LEMMAPATHweight.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.50), illustrated in Figure 17.
Figure 17. Best models of ‘diskwalificeren’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 4.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 1.822 | 0.771 | 0.795 | 0.150 | 1.886 | 0.775 | 0.784 | 0.187 |
| meanclassqual | 2.445 | 0.718 | 0.766 | 0.191 | 2.237 | 0.710 | 0.714 | 0.181 |
| classqual | ||||||||
| diskwalificeren_1 | 1.436 | 0.576 | 0.567 | -0.336 | 1.600 | 0.570 | 0.539 | -0.278 |
| diskwalificeren_2 | 1.607 | 0.872 | 0.893 | 0.303 | 1.790 | 0.887 | 0.913 | 0.358 |
| diskwalificeren_3 | 4.292 | 0.704 | 0.837 | 0.607 | 3.319 | 0.671 | 0.689 | 0.464 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for diskwalificeren is shown in Figure 18.
The highest local values are quite high for diskwalificeren_2, a bit lower (around 0.75) for diskwalificeren_" and lowest for diskwalificeren_1, but they decrease quite rapidly and drastically for diskwalificeren_3. The silhouete is lower but has a similar shape, decreasing slowly for all but diskwalificeren_3.
Figure 18. Separability indices of ‘diskwalificeren’ by rank for different measures and levels.
Figure 19. Separability indices of ‘diskwalificeren’ by harmonic mean of ranks for different measures and levels.
HAKEN
The sample of haken tokens consists of 251 tokens, with the following sense frequency: haken_1: 35, haken_2: 93, haken_3: 65, haken_4: 26, haken_5: 14, haken_6: 18.
Based on visual analysis of the cloud of models, haken belongs to FW (sequential window structure, in this case in a radial shape, with some overlap between them and the dependency-based group in an orthogonal position), FP (FOC-POS:lex separated from the rest, with FOC-POS:all between it and the dependency-based models), and LR1 (LEMMAREL:group1 is separated from the rest of the models). There is no PPMI or SOC-VECTOR structure. The ranking based on v-measures from the original procrustes matrix, confirm the FW and FP groupings, and the former also suggests SOC-VECTOR.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of haken created on 13/07/2020, modeling between 222 and 251 tokens. The stress value of the MDS solution for the cloud of models is 0.14. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 20, the main division is by FOC-POS, with a weaker one by FOC-WIN; the rightmost group is made of LEMMAREL:group1 tokens.
Figure 20. Cloud of models of ‘haken’. Explore it here.
First order filters
Figure 21 and Figure 22 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
No tokens are lost by FOC-POS:all and LEMMAPATH models (except for LEMMAPATH:selection2 + PPMI:yes): the models that lose the most tokens are the ones combining FOC-POS:nav + FOC-WIN:3 + BOUNDARIES:yes and LEMMAREL:group1, reaching up to 10.36% of lost tokens.
Figure 21. Total remaining and context words of ‘haken’.
Figure 22. Remaining contextwords per token of ‘haken’.
Notes on the clouds
The concordance of haken is characterized by a number of senses, some of which are very clearly defined by lexical collocates (strafschop[gebied] in the case of haken_3, pootje for a group of it), general context (the semantic field of ‘breien’+‘naaien’+‘hobby’ for haken_5) or constructions (‘in elkaar’ for part of haken_2, ‘naar’ for haken_6, ‘worden’ for haken_3, ‘blijven aan’ for haken_45), while others are less defined.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMI. - Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI.
Distances in the first three groups span between 0.1 and 0.8, with highest values for LEMMAREL pairs if PPMI:weight (around .5 otherwise), lowest for BOW-based and LEMMAPATH pairs and the upper half for models with different BASE, with LEMMAPATH closer to BOW-based than LEMMAREL. When FOC-POS is considered, in the second set, the lowest distances are between FOC-POS:all and LEMMAPATH models, followed by BOW-based models and FOC-POS:lex with LEMMAPATH and finally pairs with LEMMAREL.
LEMMAREL:group1 offers consistently the worse models, so they won’t be discussed. Suffice to say that whatever clusters are found in the other models, are much harder to identify in LEMMAREL:group1.
In all nMDS solutions in both sets with the notable exception of LEMMAREL:group1 (and less defined in FOC-WIN:3 and LEMMAREL:group2), there is a clear small, dense cloud with most of the haken_3 (“make someone trip”) tokens, clustered by worden and strasfschopgebied in all of them, strafschop in FOC-WIN:10 and in, en and maar in LEMMAPATH. There is little overlap between the senses, except for haken_2 (intransitive “hook”), which does overlap with haken_1 (transitive “hook”), haken_4 (figurative, intransitive “hook”) and haken_6 (with naar, “want, seek”), which on their part don’t overlap much with each other.
Among the t-SNE solutions, perplexity 20 and 30 are very similar and exhibit tighter clusters than perplexity 10, while perplexity 50 is not useful anymore. The haken_3 cluster is very clear, especially in dependency-based models, and in FOC-WIN:3 + PPMI:selection|weight it’s split between strafschop and strafschopgebied. haken_5 (crochet) and haken_6 (with naar) are particularly well separated in dependency-based models. With perplexity 20, another cluster with mostly haken_2 tokens emerges, characterized by the occurrence of in and elkaar.
All these clusters are well defined in LEMMAPATH models; haken_3 is a bit weaker in LEMMAREL:group2 and FOC-WIN:3, haken_5 prefers FOC-WIN:10 + FOC-POS:lex to other BOW-based models, and haken_6 and ‘in elkaar’, FOC-POS:all. They don’t seem to mind too much about PPMI.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes 53 solutions; there is no overlap between the top 10% of the different measures –actually, there is not even overlap between the top 10% of all values of “kNN” or “SIL”.
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAPATHweight.PPMIselection.LENGTHFOC.SOCPOSall (tsne.10) and its tsne.30 counterpart, illustrated in Figure 23.
Figure 23. Best models of ‘haken’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 5.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 2.537 | 0.728 | 0.694 | 0.155 | 2.546 | 0.715 | 0.682 | 0.177 |
| meanclassqual | 2.791 | 0.683 | 0.631 | 0.240 | 2.679 | 0.678 | 0.626 | 0.242 |
| classqual | ||||||||
| haken_1 | 1.760 | 0.458 | 0.385 | 0.045 | 1.750 | 0.429 | 0.390 | -0.081 |
| haken_2 | 1.512 | 0.719 | 0.700 | -0.254 | 1.586 | 0.703 | 0.701 | -0.177 |
| haken_3 | 3.717 | 0.957 | 0.949 | 0.593 | 3.871 | 0.951 | 0.911 | 0.625 |
| haken_4 | 2.910 | 0.571 | 0.505 | 0.382 | 3.287 | 0.501 | 0.431 | 0.435 |
| haken_5 | 1.966 | 0.510 | 0.338 | 0.064 | 2.182 | 0.587 | 0.430 | 0.129 |
| haken_6 | 4.879 | 0.886 | 0.911 | 0.609 | 3.398 | 0.895 | 0.893 | 0.519 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for haken is shown in Figure 24. DR values were ignored for 1 solution because they were too high –larger than 10– in some levels (classqual_haken_6), up to 10.46.
The higest local values are very high for haken_3 and haken_6, but the latter decreases drastically, while the rest are a bit slower, with values mostly above 0.5 for haken_2 and haken_3 and lower for the rest. The silhouet decreases just as softly (and more drastically for haken_5 and haken_6), with maximum values barely over 0.5 and mostly negative values.
Figure 24. Separability indices of ‘haken’ by rank for different measures and levels.
Figure 25. Separability indices of ‘haken’ by harmonic mean of ranks for different measures and levels.
HERINNEREN
The sample of herinneren tokens consists of 240 tokens, with the following sense frequency: herinneren_1: 76, herinneren_2: 160, herinneren_3: 4. The third senses is so infrequent that it will not be mentioned in the description of the clouds.
Based on the visual analysis of the cloud of models, herinneren belongs to FW (sequential window structure, in this case with the dependency-based group in an orthogonal position), FP (FOC-POS:lex is well separated from the rest, and FOC-POS:all is between it and the dependency-based models), and P (there is some grouping based on PPMI, although not so strong). There is no organization based on SOC-VECTOR (although LENGTH:5000 + SOC-POS:all does tend towards the edges) or special clustering of LEMMAREL:group1. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm the FW grouping, while the former confirms the FP grouping (if an FP criterion is used instead of basic FOC-POS) and the latter suggests BASE+BOUNDARIES.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of herinneren created on 13/07/2020, modeling between 169 and 240 tokens. The stress value of the MDS solution for the cloud of models is 0.118. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 26, the main split is made by FOC-POS:lex against the rest, followed by a weaker one by FOC-WIN.
Figure 26. Cloud of models of ‘herinneren’. Explore it here.
First order filters
Figure 27 and Figure 28 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
Many tokens are lost by the combination of FOC-POS:lex and PPMI:yes or FOC-WIN:3, losing 29.58% in the worst case. The dependency-based models, instead, don’t lose that many tokens, even though the keep less first order context words.
Figure 27. Total remaining and context words of ‘herinneren’.
Figure 28. Remaining contextwords per token of ‘herinneren’.
Notes on the clouds
The concordance of herinneren is characterized by two main senses and one so infrequent that can be neglected (it also barely clusters together). One of them is reflexive and occurs mostly in first (ik herinner me) and third (hij/ze/[naam] herinnert zich) person, which form distinct groups in a number of models. The other one co-occurs with aan, so that a small cluster is formed by the specific co-occurrence with eraan.
Because the distinctive items are not nouns, adjectives, verbs or adverbs, FOC-POS:lex fail at separating them the way other models do.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMI. - Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI.
The lowest distances, normally below 0.25, are between dependency-based models; it’s a bit higher between dependency-based and FOC-POS:all models, then FOC-POS:lex pairs, followed by BOW-based models with different FOC-POS and finally pairs with one dependency-based model and one FOC-POS:lex, reaching 0.8 or even higher.
In nMDS solutions, there’s a dense herinneren_2 core with a fuzzy halo and overlap between senses in the FOC-POS:lex models and better separation between them in FOC-POS:all and dependency-based models. In PPMI:weight, three clusters emerge: one for third person herinneren_2 (collocation with ‘zich’), one for first person herinneren_2 (collocation with ‘me’ and ‘ik’) and one for ‘eraan’ in herinneren_1. These clusters are most clear in FOC-WIN:10 + FOC-POS:all and LEMMAREL, and still clear but less separated in LEMMAPATH and FOC-WIN:3 + FOC-POS:all.
In t-SNE solutions, perplexity of 20 strengthens the clusters in the dependency-based models but does not improve the ones in the BOW-based models; higher perplexity does not help either. In FOC-POS:lex + PPMI:weight there can be some grouping but it’s based on the occurrence of ‘nog’, ‘niet’, ‘kunnen’ and ‘niets’. In all dependency-based and FOC-POS:all + PPMI:weight models, the two main senses are clearly separated: two groups for first and third person herinneren_2 (more separated if PPMI:weight, closer together otherwise), a small ‘eraan’ cluster and one for the rest of herinneren_1.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for all ranks includes 33 solutions.
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAREL2.PPMIselection.LENGTH5000.SOCPOSnav (tsne.50) and LEMMAREL1.PPMIselection.LENGTH5000.SOCPOSnav (tsne.50), illustrated in Figure 29.
Figure 29. Best models of ‘herinneren’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 6.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 2.628 | 0.985 | 0.984 | 0.552 | 3.408 | 0.982 | 0.984 | 0.602 |
| meanclassqual | 2.939 | 0.982 | 0.982 | 0.584 | 4.005 | 0.976 | 0.978 | 0.640 |
| classqual | ||||||||
| herinneren_1 | 3.797 | 0.974 | 0.974 | 0.675 | 5.640 | 0.958 | 0.960 | 0.743 |
| herinneren_2 | 2.081 | 0.990 | 0.989 | 0.494 | 2.371 | 0.993 | 0.996 | 0.536 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for herinneren is shown in Figure 30.
The highest local values are very high, decreasing slowly, and staying quite high for herinneren_2; the silhouete, instead, starts barely around 0.5 and decreases faster, although only the worst values go below 0.
Figure 30. Separability indices of ‘herinneren’ by rank for different measures and levels.
Figure 31. Separability indices of ‘herinneren’ by harmonic mean of ranks for different measures and levels.
HATEN
The sample of haten tokens consists of 229 tokens, with the following sense frequency: both: 16, haten_1: 102, haten_2: 111.
Based on visual analysis of the cloud of models, haten belongs to FP (FOC-POS:lex separated from the rest, in this case FOC-POS:all not separated from dependency-based), and LSP (LENGTH:5000 + SOC-POS:all models are separated from the rest, although in this case this is much more obvious within FOC-POS:lex). There is no clear organization based on PPMI or FOC-WIN nor are the LEMMAREL:group1 models clustered apart. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm only the FP grouping (through the FOC-POS division for the nMDS-based measures and through FP for the procrustes-based) and the former suggests BASE[+BOUNDARIES].
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of haten created on 13/07/2020, modeling between 172 and 229 tokens. The stress value of the MDS solution for the cloud of models is 0.155. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 32, the main division, across the horizontal axis, is made by FOC-POS, while an orthogonal split is made by SOC-VECTOR, affecting FOC-POS:lex more than the other half.
Figure 32. Cloud of models of ‘haten’. Explore it here.
First order filters
Figure 33 and Figure 34 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
A large number of tokens are lost by the combination of FOC-POS:lex + BOUNDARIES:yes (at least 5.68%), and even more if FOC-WIN:3 and/or PPMI:yes (up to 24.89%). Dependency-based models, however, don’t lose that many tokens, even though they also apply sentence boundaries and they reduce the total number of FOCs much more: LEMMAREL:group1 + PPMI:yes models lose 6.99%.
Figure 33. Total remaining and context words of ‘haten’.
Figure 34. Remaining contextwords per token of ‘haten’.
Notes on the clouds
The concordance of haten is characterized by two transitive senses (and a smaller one “in between”) distinguished mostly by the kind of object (mostly its animacy). This is apparently not enough to distinguish them in the clouds.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI. - Three groups of 8 models with
BOUNDARIES:yes + FOC-WIN:10, each with a differentPPMI, in order to compare across differentSOC-POS,LENGTHandFOC-POSbut also assess the effect ofPPMI.
Distances span between 0.1 and 0.8, with very high values for pairs where one model is FOC-POS:lex and the other one is not, very low for pairs of dependency-based models, and in between for the rest; in the lower half when PPMI:weight and going higher when PPMI:no. Distances between LENGTH:FOC models and between SOC-POS:nav models is lower, but neglectable if there is a difference in FOC-POS.
Both in nMDS and t-SNE solutions the two main senses overlap completely. The only clusters to be found (in lower perplexities, 10 and 20) are one of ‘ik’+‘het’ (mostly haten_2) and one for ‘worden’. The former is mostly visible in LEMMAREL but also in (FOC-POS:all | LEMMAPATH) + PPMI:weight with perplexity 10, and the latter only appears in FOC-WIN:3 + FOC-POS:lex + PPMI:weight. In perplexity 20, LEMMAPATH models also seem to split the tokens according to the presence or absence of ‘en’ and ‘die’.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes only 11 solutions; there is no overlap between the top 10% of the different measures –actually, there is only overlap between the top 10% of all values of “kNN”.
- LEMMAPATHselection2.PPMIweight.LENGTH5000.SOCPOSnav (tsne.30)
- BOWbound5all.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50)
- LEMMAREL1.PPMIweight.LENGTH5000.SOCPOSnav (tsne.20)
- BOWnobound5all.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.30)
- BOWbound5all.PPMIweight.LENGTHFOC.SOCPOSall (tsne.30)
- LEMMAREL2.PPMIno.LENGTHFOC.SOCPOSnav (tsne.50)
- LEMMAREL2.PPMIno.LENGTHFOC.SOCPOSnav (tsne.30)
- LEMMAREL2.PPMIno.LENGTHFOC.SOCPOSall (tsne.30)
- LEMMAREL2.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.50)
- LEMMAPATHweight.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.20)
- BOWnobound3all.PPMIno.LENGTHFOC.SOCPOSnav (tsne.20)
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAREL2.PPMIno.LENGTHFOC.SOCPOSnav (tsne.50) and LEMMAREL1.PPMIweight.LENGTH5000.SOCPOSnav (tsne.20), illustrated in Figure 35.
Figure 35. Best models of ‘haten’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 7.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 1.178 | 0.583 | 0.573 | 0.039 | 1.172 | 0.603 | 0.588 | 0.055 |
| meanclassqual | 1.125 | 0.439 | 0.426 | -0.023 | 1.100 | 0.472 | 0.429 | -0.022 |
| classqual | ||||||||
| both | 0.986 | 0.072 | 0.050 | -0.181 | 0.915 | 0.140 | 0.035 | -0.219 |
| haten_1 | 1.244 | 0.614 | 0.634 | 0.059 | 1.170 | 0.595 | 0.538 | 0.057 |
| haten_2 | 1.145 | 0.630 | 0.595 | 0.054 | 1.214 | 0.682 | 0.716 | 0.096 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for haten is shown in Figure 36.
The local values start relatively low, barely above 0.6 for haten_1 and haten_2 and around 0.2 or lower for both, but don’t decrease that much. The third, least frequent reading starts instead with the highest global values (quite low anyways), decreasing very fast.
Figure 36. Separability indices of ‘haten’ by rank for different measures and levels.
Figure 37. Separability indices of ‘haten’ by harmonic mean of ranks for different measures and levels.
HELPEN
The sample of helpen tokens consists of 233 tokens, with the following sense frequency: helpen_1: 77, helpen_2: 59, helpen_3: 40, helpen_4: 23, helpen_5: 21, om zeep helpen: 7, remove: 6.
Based on visual analysis of the cloud of models, helpen belongs to FP (FOC-POS:lex separated from the rest, with FOC-POS:all slightly closer to it than the dependency-based models), LSP (LENGTH:5000 + SOC-POS:all are separated from the rest), and LR1 (LEMMAREL:group1 is separated from the rest of the models, in this case with the LENGTH:5000 + SOC-POS:all group). There is no organization based on PPMI or FOC-WIN. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm a SOC-VECTOR grouping.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of helpen created on 13/07/2020, modeling between 196 and 233 tokens. The stress value of the MDS solution for the cloud of models is 0.184. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 38, the models are split orthogonally by FOC-POS (across the horizontal axis) and SOC-VECTOR (across the vertical one).
Figure 38. Cloud of models of ‘helpen’. Explore it here.
First order filters
Figure 39 and Figure 40 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
A number of tokens are lost by the combination PPMI:yes + FOC-POS:lex, especially when BOUNDARIES:yes, and by LEMMAREL:group1 (up to 15.88%). As usual, LEMMAPATH models don’t lose that many tokens and keep a relatively high number of context words per token, even though the total number of FOCs is low.
Figure 39. Total remaining and context words of ‘helpen’.
Figure 40. Remaining contextwords per token of ‘helpen’.
Notes on the clouds
The concordance of helpen is characterized by a number of similarly frequent senses, some of which occur in fixed constructions, caught by dependency-based of FOC-POS:all models. In the case of helpen_5, it is defined by its co-occurrence with aan, and the idiomatic expression om zeep helpen is also defined by its context (but the components have “fixed” as part-of-speech, so FOC-POS:lex excludes them). helpen_3, on the other hand, is defined by the animacy of its subject and the lack of object, but its frequent actual occurrence in a construction such as “het helpt niet” makes it easier for the models to cluster it.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI. - Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:5000 + SOC-POS:all, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMIandSOC-VECTOR.
Distances are normally around 0.7 or 0.8 for pairs with one FOC-POS:lex model, lower than 0.2 for LEMMAPATH pairs and in between for the rest; pairs with one LEMMAPATH model are more similar than those with a LEMMAREL model, and similarity between dependency-based models increases with PPMI:no and decreases with PPMI:weight.
In nMDS solutions, helpen_3 (inanimate, intransitive), helpen_5 (with aan) and om zeep helpen are relatively well grouped and don’t overlap with each other, unless PPMI:no; the rest of the senses do overlap. However, the models in which these senser are grouped are not always the same. In the first set of models, helpen_3 is quite clear in all but LEMMAREL:group1, and the other two in all but FOC-POS:lex; in the second set, intsead, the former is more or less clear in FOC-WIN:3, helpen_5 in (LEMMAREL:group1 | FOC-WIN:10 + FOC-POS:all) + PPMI:weight and LEMMAPATH + PPMI:selection, and om zeep helpen in LEMMAPATH | LEMMAREL:group2 | FOC-POS:all.
In t-SNE solutions, perplexity 20 seems to enhance the weak clusters found in lower perplexity, but higher values don’t improve the models: only helpen_5 is visible, and barely, in LEMMAREL:group1 + PPMI:selection|weight models of the first set; in the second it’s much less clear. The three forementioned clusters are still present, especially in PPMI:weight and with perplexity 10, and never in FOC-POS:lex models (where a cluster for "niet’+‘kunnen’ can be found instead, if FOC-WIN:3). helpen_3 is grouped by the co-occurrence of ‘niet’+‘het’ in all but LEMMAREL; helpen_5 is best clustered in LEMMAREL and split in two in the rest; om zeep helpen in all but LEMMAREL:group1. PPMI:weight also pulls other sets of tokens together, regardless of their sense tags: occurrences with ‘bij’, ‘vooruit’ and ‘bovenop’, especially in dependency-based models.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes only 39 solutions; there is no overlap between the top 10% of the different measures –actually, there is no overlap between the top 10% of all values of any measure.
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAREL2.PPMIselection.LENGTH5000.SOCPOSnav (tsne.20) and LEMMAREL2.PPMIno.LENGTH5000.SOCPOSnav (tsne.20), illustrated in Figure 41.
Figure 41. Best models of ‘helpen’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 8.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 1.414 | 0.435 | 0.368 | -0.022 | 1.440 | 0.435 | 0.398 | -0.011 |
| meanclassqual | 1.715 | 0.441 | 0.387 | 0.052 | 1.764 | 0.449 | 0.418 | 0.051 |
| classqual | ||||||||
| helpen_1 | 1.206 | 0.453 | 0.419 | -0.050 | 1.259 | 0.460 | 0.449 | 0.020 |
| helpen_2 | 1.000 | 0.363 | 0.233 | -0.215 | 0.984 | 0.376 | 0.272 | -0.223 |
| helpen_3 | 1.361 | 0.544 | 0.415 | 0.089 | 1.373 | 0.477 | 0.457 | 0.054 |
| helpen_4 | 0.993 | 0.082 | 0.073 | -0.240 | 0.993 | 0.081 | 0.071 | -0.255 |
| helpen_5 | 4.014 | 0.763 | 0.794 | 0.677 | 4.213 | 0.850 | 0.838 | 0.657 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for helpen is shown in Figure 42.
Figure 42. Separability indices of ‘helpen’ by rank for different measures and levels.
Figure 43. Separability indices of ‘helpen’ by harmonic mean of ranks for different measures and levels.
HERROEPEN
The sample of herroepen tokens consists of 236 tokens, with the following sense frequency: herroepen_1: 141, herroepen_2: 95.
Based on visual analysis of the cloud of models, herroepen belongs to LSP (the LENGTH:5000 + SOC-POS:all models are separated from the rest). There is no clear organization related to FOC-WIN, FOC-POS or PPMI, nor are the LEMMAREL:group1 models particularly separated. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm a SOC-VECTOR grouping and also suggest FOC-WIN and BASE[+BOUNDARIES].
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of herroepen created on 13/07/2020, modeling between 205 and 236 tokens. The stress value of the MDS solution for the cloud of models is 0.242. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 44, the cloud is not as neatly split as in other lemmas; the only visible division is given by SOC-VECTOR.
Figure 44. Cloud of models of ‘herroepen’. Explore it here.
First order filters
Figure 45 and Figure 46 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
Very few tokens are lost by BOW-based models, only up to 4.24% when all filters are applied, but more are lost by LEMMAREL:group1: 13.14% when PPMI:yes.
Figure 45. Total remaining and context words of ‘herroepen’.
Figure 46. Remaining contextwords per token of ‘herroepen’.
Notes on the clouds
The concordance of herroepen is characterized by two relatively frequent senses with different kinds of objects. The contexts seem different enough to be pulled apart in all LENGTH:FOC + SOC-POS:nav models, and some frequent lexical collocates even cluster in t-SNE solutions with low perplexity.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 8 models with
BOUNDARIES:yes + FOC-POS:lex|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-WINandBASEbut also assess the effect ofPPMI. - Three groups of 8 models with
FOC-WIN:5 + (LENGTH:5000 + SOC-POS:all) | (LENGTH:FOC + SOC-POS:nav), each with a differentPPMI, in order to compare across differentBOUNDARIES,FOC-POSandSOC-VECTORbut also assess the effect ofPPMI. - Three groups of 8 models with
LEMMAPATH | LEMMAREL:selection + (LENGTH:5000 + SOC-POS:all) | (LENGTH:FOC + SOC-POS:nav), each with a differentPPMI, in order to compare across different dependency-based templates andSOC-VECTORbut also assess the effect ofPPMI.
In the first group, distances range up to 0.7 if PPMI:no and up to 0.6 otherwise; the highest values are between LEMMAREL and BOW-based models and between FOC-WIN:3 and all but FOC-WIN:5; the lowest are for LEMMAREL and LEMMAPATH pairs. In the second group, distances between models with different SOC-VECTOR lie between 0.35 and 0.5 if PPMI:weight, and slightly higher (up to 0.8) otherwise, where LENGTH:5000 + SOC-POS:all models with different FOC-POS are also far from each other; it’s much lower for those with the same SOC-VECTOR and much lower if the models share FOC-POS. The values are very similar if FOC-WIN is kept constant at another value. Finally, in the third group, the highest differences (up to 0.5 if PPMI:weight|selection, up to 0.7 otherwise) are between models with different SOC-VECTOR and lower if it’s the same, with the lowest values for LEMMAPATH pairs.
In the n-MDS clouds, there is a relatively good split between senses in all models, with contact between them but not much overlap. Between the BOW-based models, there seems to be less overlap in the BOUNDARIES:yes + LENGTH:FOC + SOC-POS:nav models. In the dependency-based, the senses are well split in LENGTH:FOC + SOC-POS:nav but overlap much more in LENGTH:5000 + SOC-POS:all.
In the t-SNE clouds, some clusters emerge for certain frequent lexical collocates, but the rest of the tokens become more disperse the higher the perplexity, hiding the clusters in perplexity 50. There are five main clusters: one of herroepen_1 gathered by ‘beslissing’, two of herroepen_2 gathered by ‘verklaring’ and ‘laat’ (“dit werd later herroepen”), and one dense herroepen_2 cluster gathered by ‘uitspraak’ that hangs out next to a less compact herroepen_1 cluster gathered by ‘worden’ and several legal terms (‘rechtbank’, ‘vonnis’, etc.).
In LENGTH:5000 + SOC-POS:all models, the ‘laat’ cluster is better shaped but the ‘uitspraak’ cluster loses its herroepen_1 friend and from perplexity 20 only the ‘beslissing’ cluster is clear, and well separated from everything else. All clusters are more clear in BOUNDARIES:yes + LENGTH:FOC + SOC-POS:nav.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The 3 models whose values lie in the top 10% for all ranks are listed below.
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50)
- LEMMAPATHselection2.PPMIweight.LENGTHFOC.SOCPOSnav (mds)
- LEMMAPATHselection3.PPMIweight.LENGTHFOC.SOCPOSnav (mds)
Of these methods, the ones with highest ranking v-measures across different methods are BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50) and LEMMAPATHselection3.PPMIweight.LENGTHFOC.SOCPOSnav (mds), illustrated in Figure 47.
Figure 47. Best models of ‘herroepen’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 9.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 1.443 | 0.805 | 0.780 | 0.246 | 1.406 | 0.821 | 0.813 | 0.243 |
| meanclassqual | 1.449 | 0.795 | 0.771 | 0.248 | 1.415 | 0.816 | 0.805 | 0.248 |
| classqual | ||||||||
| herroepen_1 | 1.415 | 0.848 | 0.818 | 0.237 | 1.372 | 0.840 | 0.847 | 0.224 |
| herroepen_2 | 1.484 | 0.741 | 0.723 | 0.259 | 1.458 | 0.792 | 0.763 | 0.273 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for herroepen is shown in Figure 48.
Figure 48. Separability indices of ‘herroepen’ by rank for different measures and levels.
Figure 49. Separability indices of ‘herroepen’ by harmonic mean of ranks for different measures and levels.
HERHALEN
The sample of herhalen tokens consists of 313 tokens, with the following sense frequency: herhalen_1: 90, herhalen_2: 160, herhalen_3: 36, herhalen_4: 27.
Based on visual analysis of the cloud of models, herhalen belongs to FP (FOC-POS:lex separated from the rest, with FOC-POS:all between it and the dependency-based models), LSP (LENGTH:5000 + SOC-POS:all are separated from the rest), and P (PPMI:weight models tend towards the center, PPMI:no models towards the outside, and PPMI:selection models are in between) and LR1 (LEMMAREL:group1 is separated from the rest of the models). There is no clear organization based on FOC-WIN and, while LEMMAREL:group1 models are pushed a bit farther away, they are split by LENGTH:5000 + SOC-POS:all. The ranking based on v-measures based on the original procrustes matrix confirms a SOC-VECTOR grouping and the nMDS solution suggests BASE+BOUNDARIES.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of herhalen created on 13/07/2020, modeling between 267 and 312 tokens. The stress value of the MDS solution for the cloud of models is 0.237. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 50, the models are split orthogonally by FOC-POS (across the vertical axis) and SOC-VECTOR.
Figure 50. Cloud of models of ‘herhalen’. Explore it here.
First order filters
Figure 51 and Figure 52 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
A large number of tokens (9.94%) are lost in LEMMAREL:group1 + PPMI:yes models, and even more (4.81% to 14.42%) in FOC-WIN:3|5 + FOC-POS:lex + PPMI:yes models, but not many by the rest.
Figure 51. Total remaining and context words of ‘herhalen’.
Figure 52. Remaining contextwords per token of ‘herhalen’.
Notes on the clouds
The concordance of herhalen is characterized by four relatively frequent senses, the most frequent of which covers half the instances.
The second least frequent sense, also the only reflexive one rather than transitive, is characterized by the co-occurrence of the reflexive pronoun ‘zich’ (it only occurs in third person) and, for about half its cases, of ‘geschiedenis’, its most frequent subject.
The least frequent one is also more specific than the others in terms of the semantic field of its arguments, which seems to make it easier to cluster.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI. - Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:5000 + SOC-POS:all, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMIandSOC-VECTOR. - One group of 9 models with
(BOUNDARIES:yes + FOC-WIN:10 | LEMMAREL:group2) + LENGTH:FOC + SOC-POS:navin order to compare across differentFOC-POS,PPMIandBASE. - One group of 9 models with
LEMMAREL|LEMMAPATH + PPMI:weight|selection + LENGTH:FOC + SOC-POS:nav, in order to compare different dependency templates andPPMI.
In the first group, distances span between 0.1 and 0.6, with lowest values for dependency-based pairs of models, and with FOC-WIN making a greater difference than FOC-POS.
The nMDS solutions tend to exhibit areas dedicated to the different senses (unless FOC-WIN:3 + FOC-POS:lex), with various degrees of overlap, especially in the LENGTH:5000 + SOC-POS:all group.
In t-SNE solutions, perplexity 10 gives numerous tiny clusters where herhalen_3 (reflexive) starts to separate, but with no separability between the rest of the senses in LENGTH:5000 + SOC-POS:all; perplexities of 30 and 50 don’t offer much improvement on perplexity 20.
PPMI:weight models split herhalen_3 in two based on the occurrence of ‘geschiedenis’; the dependency based models, especially LEMMAREL:group2m are the ones that keep areas for herhalen_3 and herhalen_4 the best, although there is still some overlap between herhalen_1 and herhalen_2.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes 68 solutions; there is no overlap between the top 10% of the different measures -there is only overlap between the top 10% of all values of all individual measures except for “SIL”.
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAREL2.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.30) and LEMMAREL1.PPMIselection.LENGTHFOC.SOCPOSnav (tsne.20), illustrated in Figure 53.
Figure 53. Best models of ‘herhalen’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 10.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 2.042 | 0.736 | 0.718 | 0.093 | 1.800 | 0.703 | 0.660 | 0.095 |
| meanclassqual | 2.822 | 0.707 | 0.688 | 0.260 | 2.353 | 0.678 | 0.647 | 0.223 |
| classqual | ||||||||
| herhalen_1 | 1.336 | 0.592 | 0.542 | -0.107 | 1.471 | 0.565 | 0.488 | 0.099 |
| herhalen_2 | 1.396 | 0.806 | 0.801 | -0.004 | 1.165 | 0.760 | 0.717 | -0.071 |
| herhalen_3 | 6.344 | 0.993 | 1.000 | 0.777 | 5.046 | 0.936 | 0.963 | 0.681 |
| herhalen_4 | 2.212 | 0.438 | 0.408 | 0.374 | 1.730 | 0.451 | 0.420 | 0.181 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for herhalen is shown in Figure 54.
Figure 54. Separability indices of ‘herhalen’ by rank for different measures and levels.
Figure 55. Separability indices of ‘herhalen’ by harmonic mean of ranks for different measures and levels.
HULDIGEN
The sample of huldigen tokens consists of 230 tokens, with the following sense frequency: huldigen_1: 163, huldigen_2: 67.
Based on visual analysis of the cloud of models, huldigen belongs only to P (PPMI:weight close together). The FOC-POS:lex models are grouped together but not as separated from the rest, nor is FOC-POS:all between it and the dependency-based models like in the other FP models. There is also no clear organization based on FOC-WIN and neither the LENGTH:5000 + SOC-POS:all nor the LEMMAREL:group1 models are separated from the rest. The ranking based on v-measures based on nMDS coordinates confirms a PPMI division and suggests FOC-POS; the one based on the original procrustes distances suggests FOC-WIN, SOC-VECTOR and BASE+BOUNDARIES.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of huldigen created on 13/07/2020, modeling between 202 and 230 tokens. The stress value of the MDS solution for the cloud of models is 0.238. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 56, there is no clear clustering of models as in other lemmas; the strongest parameters are FOC-POS and PPMI.
Figure 56. Cloud of models of ‘huldigen’. Explore it here.
First order filters
Figure 57 and Figure 58 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
A number of tokens are lost by LEMMAREL:group1 and FOC-WIN:3 + FOC-POS:lex, especially if PPMI:yes (up to 12.17%), and some by FOC-POS:lex + BOUNDARIES:yes, but not that many by the rest.
Figure 57. Total remaining and context words of ‘huldigen’.
Figure 58. Remaining contextwords per token of ‘huldigen’.
Notes on the clouds
The concordance of huldigen is characterized by two senses, one 2.43 times as frequent as the other, that occur in quite different contexts.
The least frequent sense seems to have stronger lexical collocates (e.g., principe, standpunt, opvatting), that group the corresponding tokens in small pockets, while the former is more spread around. That said, they are consistently well separated from each other.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:5000 + SOC-POS:all``, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI`. - Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:5000 + SOC-POS:all, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMIandSOC-VECTOR. - One group of 9 models with
(BOUNDARIES:yes + FOC-WIN:10 | LEMMAREL:group2) + LENGTH:FOC + SOC-POS:navin order to compare across differentFOC-POS,PPMIandBASE. - One group of 9 models with
LEMMAREL|LEMMAPATH + PPMI:weight|selection + LENGTH:FOC + SOC-POS:nav, in order to compare different dependency templates andPPMI.
Distances span between 0.1 for LENGTH:5000 + SOC-POS:all models and reach 0.8 with LENGTH:5000 + SOC-POS:all; they are lowest between LEMMAPATH pairs and highest between pairs with some LEMMAREL model.
In nMDS solutions, (LEMMAPATH | FOC-POS:lex) + PPMI:selection|weight + LENGTH:FOC + SOC-POS:nav show two main cores, one for each sense, with a respectable separation between them, while there is one main cloud in the rest of the models, showing some overlap in LEMMAREL and LENGTH:5000 + SOC-POS:all.
In t-SNE models, the sizes of the clusters varies across models and perplexity values, but both senses remain always well separated (unless PPMI:no + LENGTH:5000 + SOC-POS:all). huldigen_1 (“honor someone”) is always much more widespread than huldigen_2 (“hold a point of view”); the structure seems most clear with perplexity 20, but it spreads much wider with higher values.
In terms of pockets, huldigen_2 presents some strong, compact clusters based on the occurrence of ‘principe’ (which can be seen in some PPMI:weight models in nMDS solutions) and, if PPMI:weight, ‘standpunt’ and ‘opvatting’. huldigen_1 mostly exhibits a strong core in FOC-POS:lex and sometimes in LEMMAPATH | LEMMAREL:group2, marked by the presence of ‘worden’. In dependency-based and FOC-WIN:10 models, if PPMI:weight + LENGTH:FOC + SOC-POS:nav, there is also a ‘kampioen’ cluster to be found for huldigen_1, but it fades away with larger perplexities.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for all ranks includes 18 solutions:
- BOWnobound5lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.30)
- BOWnobound5lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.30)
- BOWbound5lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50)
- BOWbound5lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.30)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.30)
- BOWnobound5lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.20)
- BOWnobound5lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.50)
- BOWnobound5lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.20)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.20)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.50)
- BOWnobound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.20)
- BOWbound5lex.PPMIselection.LENGTHFOC.SOCPOSall (tsne.30)
- BOWnobound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.30)
- BOWnobound10lex.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.20)
- BOWbound5lex.PPMIselection.LENGTHFOC.SOCPOSall (tsne.50)
- BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.10)
Of these methods, the ones with highest ranking v-measures across different methods are BOWnobound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.20) and BOWbound10lex.PPMIweight.LENGTHFOC.SOCPOSall (tsne.10), illustrated in Figure 59.
Figure 59. Best models of ‘huldigen’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 11.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 2.317 | 0.930 | 0.934 | 0.485 | 2.177 | 0.924 | 0.927 | 0.466 |
| meanclassqual | 2.433 | 0.913 | 0.913 | 0.496 | 2.278 | 0.906 | 0.908 | 0.483 |
| classqual | ||||||||
| huldigen_1 | 2.160 | 0.953 | 0.963 | 0.470 | 2.034 | 0.950 | 0.954 | 0.443 |
| huldigen_2 | 2.706 | 0.873 | 0.864 | 0.523 | 2.523 | 0.862 | 0.862 | 0.523 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for huldigen is shown in Figure 60.
Figure 60. Separability indices of ‘huldigen’ by rank for different measures and levels.
Figure 61. Separability indices of ‘huldigen’ by harmonic mean of ranks for different measures and levels.
HERSTRUCTUREREN
The sample of herstructureren tokens consists of 240 tokens, with the following sense frequency: herstructureren_1: 71, herstructureren_2: 99, herstructureren_3: 70.
Based on visual analysis of the cloud of models, herstructureren belongs to FW (sequential window structure, in this case with dependency-based on the side of the smaller window), LSP (the LENGTH:5000 + SOC-POS:all models are separated from the rest), FP (FOC-POS:lex separated from the rest, with FOC-POS:all between it than the dependency-based models, unless LENGTH:5000 + SOC-POS:all), and LR1 (LEMMAREL:group1 is clustered together with LENGTH:5000 + SOC-POS:all). There is no PPMI or SOC-VECTOR structure. The ranking based on v-measures, both from the nMDS coordinates or the original procrustes matrix, confirm the FP grouping, the former also confirms FW and the latter suggests BASE[+BOUNDARIES].
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of herstructureren created on 13/07/2020, modeling between 146 and 240 tokens. The stress value of the MDS solution for the cloud of models is 0.207. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 62, there is a lower section with LENGTH:5000 + SOC-POS:all models, with dependency-based models to the right and BOW-based models in a ray streaming to the left, and the rest is split by FOC-POS.
Figure 62. Cloud of models of ‘herstructureren’. Explore it here.
First order filters
Figure 63 and Figure 64 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
A number of tokens is lost by FOC-WIN:3 + FOC-POS:lex, especially if BOUNDARIES:yes and PPMI:yes (up to 8.33%) but even more by LEMMAREL:group1: 20.42% with PPMI:no and 30.42% with PPMI:yes. LEMMAREL:group2, on the other hand, does not lose that many tokens, even though it keeps less FOCs than FOC-WIN:3 + FOC-POS:lex.
Figure 63. Total remaining and context words of ‘herstructureren’.
Figure 64. Remaining contextwords per token of ‘herstructureren’.
Notes on the clouds
The concordance of herstructureren is characterized by three senses with very similar frequency: the first two are transitive and apply to different objects, while the third one is intransitive and matches the semantic application of the second sense (companies).
Maybe in concordance with such an overlapping definition, the senses tend to overlap, but also cluster in three groups, each of them excluding one sense: the cluster with predominance of bedrijf excludes the first sense (the one not relating to companies); the one with predominance of worden excludes the intransitive sense, and the one with predominance of om te excludes the second sense, although to a lesser extent.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI.
Distances span between 0.4 and 0.8 for pairs with some BOW-based model, but lower for those with same FOC-WIN, as well as for LEMMAPATH pairs.
In nMDS solutions, clouds mostly have a dense core with a halo and a lot of overlap between the senses. The t-SNE solutions, especially with perplexity 20, do show three main clusters –and sometimes a couple more– but they don’t really match the senses; higher perplexities are just more disperse versions of the same topology.
The most clear small clusters are one with ‘schuld’ as co-occurrence (mostly herstructureren_1) in PPMI:weight|selection, one with ‘grondig’ in PPMI:weight and one with aan in most FOC-WIN:all | LEMMAPATH models. The most interesting organization, however, is the three main clusters, that take up a triangular formation in LEMMAREL:group2 and a line in LEMMAPATH. The groups seem to be characterized by the co-occurrence of ‘om’+‘te’, ‘worden’ and ‘worden’+‘om’+‘bedrijf’. Rather than exhibiting a predominant sense, they seem to repel senses: ‘om’+‘te’ seems to repel herstructureren_2, ‘worden’ herstructureren_3, and ‘worden’+‘om’+‘bedrijf’, herstructureren_1 (fascinating, right?).
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes only 15 solutions; there is no overlap between the top 10% of the different measures –actually, there is only overlap between the top 10% of all values of “kNN” and “SCP”.
- BOWnobound10all.PPMIweight.LENGTHFOC.SOCPOSall (tsne.20)
- BOWnobound10all.PPMIweight.LENGTHFOC.SOCPOSnav (tsne.20)
- LEMMAPATHselection2.PPMIselection.LENGTHFOC.SOCPOSall (mds)
- LEMMAPATHweight.PPMIselection.LENGTH5000.SOCPOSnav (tsne.30)
- LEMMAPATHweight.PPMIselection.LENGTH5000.SOCPOSnav (tsne.10)
- LEMMAPATHweight.PPMIselection.LENGTH5000.SOCPOSnav (tsne.50)
- LEMMAPATHselection2.PPMIno.LENGTHFOC.SOCPOSnav (mds)
- LEMMAPATHselection2.PPMIselection.LENGTHFOC.SOCPOSnav (mds)
- LEMMAPATHselection2.PPMIno.LENGTHFOC.SOCPOSall (mds)
- LEMMAPATHselection2.PPMIweight.LENGTHFOC.SOCPOSall (mds)
- LEMMAPATHweight.PPMIselection.LENGTH5000.SOCPOSnav (tsne.20)
- LEMMAREL2.PPMIselection.LENGTHFOC.SOCPOSnav (mds)
- LEMMAREL2.PPMIselection.LENGTHFOC.SOCPOSall (tsne.50)
- LEMMAREL2.PPMIno.LENGTHFOC.SOCPOSnav (mds)
- LEMMAPATHselection2.PPMIweight.LENGTHFOC.SOCPOSnav (mds)
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAREL2.PPMIselection.LENGTHFOC.SOCPOSall (tsne.50) and LEMMAPATHweight.PPMIselection.LENGTH5000.SOCPOSnav (tsne.50), illustrated in Figure 65.
Figure 65. Best models of ‘herstructureren’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 12.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 1.108 | 0.498 | 0.454 | 0.004 | 1.037 | 0.465 | 0.415 | -0.038 |
| meanclassqual | 1.083 | 0.466 | 0.417 | -0.018 | 1.045 | 0.455 | 0.402 | -0.031 |
| classqual | ||||||||
| herstructureren_1 | 0.966 | 0.456 | 0.394 | -0.103 | 0.996 | 0.361 | 0.278 | -0.113 |
| herstructureren_2 | 1.246 | 0.606 | 0.587 | 0.118 | 1.024 | 0.545 | 0.513 | -0.031 |
| herstructureren_3 | 1.038 | 0.336 | 0.271 | -0.068 | 1.115 | 0.458 | 0.416 | 0.050 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for herstructureren is shown in Figure 66.
Figure 66. Separability indices of ‘herstructureren’ by rank for different measures and levels.
Figure 67. Separability indices of ‘herstructureren’ by harmonic mean of ranks for different measures and levels.
HERSTELLEN
The sample of herstellen tokens consists of 239 tokens, with the following sense frequency: herstellen_1: 32, herstellen_2: 90, herstellen_3: 15, herstellen_4: 38, herstellen_5: 57, herstellen_6: 7.
Based on visual analysis of the cloud of models, herstellen belongs to FP (FOC-POS:lex separated from the rest, with FOC-POS:all slightly closer to it than the dependency-based models), LSP (the LENGTH:5000 + SOC-POS:all models are separated from the rest), and LR1 (LEMMAREL:group1 is separated from the rest of the models, although split by SOC-VECTOR). There is no clear structure based on FOC-WIN or PPMI. The ranking based on v-measures from the nMDS solutions confirms a SOC-VECTOR grouping and suggests FOC-WIN, while the one from the original procrustes matrix suggests BASE+BOUNDARIES.
Strength of parameters
A first impression on the clouds relates to the stress values of the dimensionality reduction and the parameters that make the strongest distinctions between models. We have 204 models of herstellen created on 13/07/2020, modeling between 210 and 239 tokens. The stress value of the MDS solution for the cloud of models is 0.206. The stress values of the MDS solutions of these models range between Inf and -Inf.
As can be seen in Figure 68, the models are split orthogonally by FOC-POS (across the horizontal axis) and SOC-VECTOR; the rightmost models are LEMMAREL:group1.
Figure 68. Cloud of models of ‘herstellen’. Explore it here.
First order filters
Figure 69 and Figure 70 show the quantitative effect of the first order filters on the total number of FOCs and per token respectively. The panels to the left show the data from the BOW-based models, while the one(s) on the right, for dependency-based models.
A number of tokens is lost by the strictest BOW-based combinations, although without much difference between BOUNDARIES values, and by LEMMAREL:group1 (up to 12.13%).
Figure 69. Total remaining and context words of ‘herstellen’.
Figure 70. Remaining contextwords per token of ‘herstellen’.
Notes on the clouds
The concordance of herstellen is characterized by six senses with different frequencies and presenting different combinations of semantic fields of their objects and argument structure.
A few lexical collocates (mostly in ere, but also evenwicht) and constructions (e.g. zijn van) can pull groups of tokens together and cluster them separately in t-SNE solutions. Irrespective of them, the four main senses keep a rather good separation in most models; of the two infrequent ones, the one with a more specific application (financial entities) always overlaps with the same main sense (with a similar application), while the vaguer one is spread all over the place.
Visual comparison
In order to compare models along a defined set of parameters, the following groups of clouds were looked at side by side:
- Three groups of 9 models with
BOUNDARIES:yes + FOC-WIN:3|10|NA + LENGTH:FOC + SOC-POS:nav, each with a differentPPMI, in order to compare across differentFOC-POSandBASEbut also assess the effect ofPPMI. - Three groups of 8 models with
(BOUNDARIES:yes + FOC-WIN:10 | LEMMAPATH)as first order alternatives and(LENGTH:5000 + SOC-POS:all) | (LENGTH:FOC + SOC-POS:nav)as second order alternatives, each with a differentPPMI, in order to compare across differentFOC-POSandSOC-VECTORbut also assess the effect ofPPMI.
Distances span between 0.1 and 0.7, highest for pairs with LEMMAREL1 and lowest for pairs with PATH; distances between LEMMAREL models decrease with PPMI:no, while distances between pairs with one FOC-POS:lex increase. In the second group, distances between models with different SOC-VECTOR or LENGTH:5000 + SOC-POS:all + PPMI:selectoin|2no are around 0.5 or higher, while the rest (and particularly, all LEMMAPATH pairs) are lower: SOC-VECTOR can make a greater difference than BASE.
The LEMMAREL:group1 models perform consistently worse than the others, with much more overlap and less clear clusters, so it will be excluded from the description below.
In the nMDS solutions ther eis a clear cluster of tokens joined by the ocurrence of ‘in’+‘ere’ (herstellen_2) and a couple of outliers that can really distort the picture, especially in PPMI:selection and/or LENGTH:5000 + SOC-POS:all (the most powerful one only has one context word in those models, namely ‘naaien’). The infrequent sense herstellen_6 (intransitive, for financial entities) always overlaps with or is included within herstellen_4 (reflexive; often with “economie” or similar entities as subject), so they will be spoken of as one. If PPMI:weight and/or LENGTH:FOC + SOC-POS:nav, all senses except herstellen_3 (transitive, referring to mistakes; quite infrequent) are relatively well separated from each other, taking different areas in the cloud, and the ‘in ere’ cluster hovers close to the major herstellen_2 group.
In t-SNE solutions, perplexity 20 strengthens the organization in PPMI:weight models, but does not improve the rest that much, and higher perplexities only increase dispersion. The ‘in ere’ cluster is always well defined, but not necessarily in the vicinity of herstellen_2 (that may be random, although it does occur more often with perplexity 20 than with 10). PPMI:weight models add a furhter cluster marked by the occurrence of ‘evenwicht’ (also herstellen_2), best separated if FOC-WIN:10 + FOC-POS:all | LEMMAPATH:selection3, while PPMI:selection|no clusters most of herstellen_5 by the co-occurrence of ‘van’+‘zijn’ in the same models.
Other than these clusters, the rest of the tokens form a wider mass, with the main senses taking up specific areas. This may allow for a meaningful organization, such as the herstellen_4 group between the transitive senses on one side and herstellen_5 (intransitive, “to heal”) on the other in (FOC-WIN:10 + FOC-POS:all | LEMMAPATH:selection3) + PPMI:weight with perplexity 20.
Clustering and separability measures
One other way of selecting the “best” models, or of quantifying their quality, is to use separability indices (from the semvar package). The four main indices (“DR”, “SIL”, “SCP” with k=10, and “kNN” with k=10) have been computed and their values (global quality, quality by class and mean class quality) have been ranked. The top 10% for harmonic mean of ranks includes 46 solutions; there is no overlap between the top 10% of the different measures –actually, there is only overlap between the top 10% of all values of “kNN” and “SCP”.
Of these methods, the ones with highest ranking v-measures across different methods are LEMMAPATHselection3.PPMIweight.LENGTH5000.SOCPOSnav (tsne.10) and BOWbound10all.PPMIweight.LENGTH5000.SOCPOSnav (tsne.20), illustrated in Figure 71.
Figure 71. Best models of ‘herstellen’ according to separability indices and v-measures.
The actual separability indices for these two best models is shown in Table 13.
| level | DR | kNN | SCP | SIL | DR | kNN | SCP | SIL |
|---|---|---|---|---|---|---|---|---|
| globqual | 2.043 | 0.698 | 0.662 | 0.036 | 1.851 | 0.694 | 0.639 | 0.020 |
| meanclassqual | 2.253 | 0.624 | 0.567 | 0.099 | 2.060 | 0.630 | 0.557 | 0.088 |
| classqual | ||||||||
| herstellen_1 | 2.053 | 0.454 | 0.442 | 0.253 | 2.350 | 0.565 | 0.509 | 0.339 |
| herstellen_2 | 1.196 | 0.760 | 0.726 | -0.187 | 1.022 | 0.748 | 0.696 | -0.255 |
| herstellen_3 | 1.372 | 0.302 | 0.086 | -0.131 | 1.302 | 0.290 | 0.075 | -0.178 |
| herstellen_4 | 5.190 | 0.920 | 0.926 | 0.660 | 4.095 | 0.886 | 0.924 | 0.580 |
| herstellen_5 | 1.452 | 0.685 | 0.655 | -0.099 | 1.530 | 0.661 | 0.583 | -0.046 |
More on separability indices
While the “best” models were chosen based on a ranking, the range of the values is very different across lemmas. The curve of the different values by rank for herstellen is shown in Figure 72. DR values were ignored for 2 solutions because they were too high –larger than 10– in some levels (globqual, meanclassqual, classqual_herstellen_2, classqual_herstellen_3, classqual_herstellen_4, classqual_herstellen_5), up to 682.1.
Figure 72. Separability indices of ‘herstellen’ by rank for different measures and levels.
Figure 73. Separability indices of ‘herstellen’ by harmonic mean of ranks for different measures and levels.
While some features four steps away can be interesting, such as passive subjects of a verb with two modals, they are not that frequent and may not be worth the noise included by accepting all features with so many steps between them and the target. To catch those relationships,
LEMMARELis a more efficient method.↩“median” and “centroid” were also used, but they consistently failed. The
cutree()function meant to split the resulting dendrogram in the same number of branches as was required by each parameter consistently returned more splits than requested.↩A contextual synonym that occurs in the sample is toast: it only occurs once in a 4-4 window of the target, and 6 times in the 10-10 window. Its PPMI based on the former is 3.15. Some models fit it perfectly next to the rest of glas, while dependency-based models seem to lose it (they do not select toast as a context word).↩
Only tokens from classes with an absolute frequency above 10 have been included in the computations, that is, only harden_3 and harden_5.↩
However, my notes don’t comment on the good clustering of this sense. I should check it out.↩